-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-5929][PYSPARK] Context addPyPackage and addPyRequirements #12398
Conversation
So looking back on the predecessor PR it seemed like @davies suggested adding support for specifying these at runtime with |
I think having this supported could be very useful :) |
import pip | ||
for req in pip.req.parse_requirements(path, session=uuid.uuid1()): | ||
if not req.check_if_exists(): | ||
pip.main(['install', req.req.__str__()]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it seems that this can sometimesrequire elevated privileges based on the issues with the previous jenkins run. What about if at startup we created a fixed temp directory per context adding it to our path with sys.path.insert(0, self.pipBase)
and at install did something along the lines of:
pip.main(['install', req.req.__str__(), '--target', self.pipBase])
so that we don't have to have write permissions to the default pip target?
Since @andrewor14 got Jenkins to run the earlier iteration of this PR maybe you could do the same here? |
I could see the pipBase approach working. The other testing blocker I have is that it seems like that context that's getting created in the test is able to use local packages. The test package that's created is able to be imported on the workers without distributing it via addPyPackage. https://github.com/apache/spark/pull/12398/files#diff-388a156f4ce454fe98d7a99a0f7f0012R1950 Is there a way to get the mocked workers to not use the master python path? The pip part is nice, but I think the meat of this PR is in the addPyPackage. I looked into adding a flag to the CLI, but kind of went down a rabbit hole trying to figure it out. I'd advocate adding it to the CLI, but think someone working in the spark core space could get that integrated much more efficiently than myself. |
ok to test |
Test build #2967 has finished for PR 12398 at commit
|
self.assertSequenceEqual([0, 3, 6, 9], trips.collect()) | ||
finally: | ||
shutil.rmtree(name) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove the extra empty line
@buckhx These API seems useful, could you also add an argument for bin/spark-submit (only for requirement file) ? |
Added an addPyPackage example to the docstring and that test formatting. We're looking into adding the spark-submit arg for --py-requirements. The pip API takes a file path via pip.req.parse_requirements https://github.com/pypa/pip/blob/develop/pip/req/req_file.py#L64 but it looks like it might be possible to take a string and break it out line by with a mock file name via pip.req.process_line & pip.req.preprocess https://github.com/pypa/pip/blob/develop/pip/req/req_file.py#L110 |
@buckhx Even the pip API only takes a file path, we could write the as temporary file, I think. |
add py-requirements submit option
@robert3005 added the --py-requirements to spark-submit. I'm looking at passing the requirements as a list or a line delimited string currently. |
changed addRequirementsFile to addPyRequirements which takes a list of requirement strings |
@davies how's this looking? |
Test build #3068 has finished for PR 12398 at commit
|
@buckhx any interest in updating this against master? |
@buckhx Just following up to see if this is something you are still interested in working on? |
ping @buckhx |
What changes were proposed in this pull request?
Context.addPyPackage()
Context.addPyRequirements()
Both of these methods take a package on the master and ship it to the workers when called instead of having to manually install packages on all workers.
How was this patch tested?
Unit tests are written, but I do not believe they accurately reflect a distributed environment. The test_add_py_package is not using addPyPackage and still works. The addRequirementsFile method requires internet access to hit the global pypi server and won't work on the current Jenkins build system.
We have had this patch running at Palantir for about a year in production.